Subvert `insert_overwrite` merge strategy to bring back merge #371

elyobo · 2022-11-02T08:43:28Z

Resolves #16

Description

For comment only to see if there's any interest in the idea, not necessarily this particular implementation.

The insert overwrite implementation uses BigQuery scripting variables to extract store the partition values that will be updated, allowing efficient pruning of the insert query with substantial cost and performance benefits. This works well when those values are small enough to fit inside BQ scripting's memory limit of 1 MiB as described in #16, which I guess is going to be hit at 16,384 INT64 IDs.

Prior to the current insert overwrite strategy, @jtcohen6 worked on a similar approach that used minimum and maximum partition values instead in dbt-labs/dbt-core#1971, which works around the memory limit because only two values will be selected. It's not as efficient (unnecessary partitions between the min and the max will still be read), but the worst case performance is that it's no better than the merge + temp table cost (in which case users can opt to use the merge strategy) and in the best case (a contiguous block of IDs, e.g. in append only data) then it will prune partitions as well as the insert overwrite.

This PR just tinkers with the insert overwrite to bring back this approach as an optional variant, but it's probably confusing UX to have this very-much-not-an-insert-overwrite inside the "insert overwrite" strategy, and it has no tests - I made it to confirm that it was possible and that it worked in practice not just in my head, and it did offer significant improvements for me.

If there's interest in bringing this back I'm happy to work on an improved version that addresses these and any other issues raised.

Alternatives Considered

Users can also implement this themselves (to a large extent, at least; I had SQL header issues) by implementing their own bigquery__get_merge_sql. I added one in #16 (comment) as an example which is working for me although, as mentioned, I had SQL header issues - they're output to build the temp table and I can't seem to suppress them, so they're output again for the real select.

It would be more convenient, simple, and less error prone if this was supported directly.

Users can also fall back to the existing merge approach, but this can be much more expensive.

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

cla-bot · 2022-11-02T08:43:36Z

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @elyobo

elyobo · 2022-11-02T08:56:27Z

Happy to sign the CLA if there's interest in getting this merged bot :) I did have a minor PR in dbt-utils, different CLA requirements there?

elyobo · 2022-11-24T00:46:56Z

Would appreciate any feedback on the idea here; it's a valuable cost saver for BigQuery users where the insert/overwrite strategy doesn't work due to memory limits in BQ "scripts" e.g. as described in #16.

elyobo · 2023-02-09T22:54:47Z

Implementation being discussed in #16 but this doesn't look like the way we'll be going so closing.

Subvert insert_overwrite merge strategy to bring back merge

2227352

Fleid mentioned this pull request Jan 30, 2023

[CT-2051] [Bug] New insert_overwrite Bigquery partitioning with integer keys can create huge temporary array variables, exceeding BQ limits #16

Open

5 tasks

Fleid added the awaiting_response label Jan 30, 2023

elyobo closed this Feb 9, 2023

elyobo deleted the merge-inside-insert-overwrite branch February 9, 2023 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subvert `insert_overwrite` merge strategy to bring back merge #371

Subvert `insert_overwrite` merge strategy to bring back merge #371

elyobo commented Nov 2, 2022

cla-bot bot commented Nov 2, 2022

elyobo commented Nov 2, 2022

elyobo commented Nov 24, 2022

elyobo commented Feb 9, 2023

Subvert insert_overwrite merge strategy to bring back merge #371

Subvert insert_overwrite merge strategy to bring back merge #371

Conversation

elyobo commented Nov 2, 2022

Description

Alternatives Considered

Checklist

cla-bot bot commented Nov 2, 2022

elyobo commented Nov 2, 2022

elyobo commented Nov 24, 2022

elyobo commented Feb 9, 2023

Subvert `insert_overwrite` merge strategy to bring back merge #371

Subvert `insert_overwrite` merge strategy to bring back merge #371